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In  !990  is  het  FEL-TNO  in  sanienwerking  met  dc  TU  Dell'l  begonnen  met  hel  project  real-time  SAR 
(RT-SAR),  dat  wordl  gelinancierd  diwrTNO  cn  hel  Minisierie  van  Delensic.  Het  resultaat  van  hel 
project  is  de  definilie  van  ecn  high-perfoniiance  RT-SAR  processor  voor  defensie  doeleinden.  De 
RT-SAR  processor  specificaties  zijn  gcbascerd  op  de  PHascd  ARray  Universal  Sar  (PHARUS),  die 
op  dit  moment  door  FEL-TNO  wordt  ontwikkcld. 

Hci  onderzock  dal  in  dit  rapport  wordl  bcschrcven  hcelj  bclrekking  op  ecn  essenticle  operatic  binncn 
de  RT-SAR  processing;  de  range-/a/imulhcomprcssie.  Gczicn  dc  complexiieil  wordl  deze  operatic 
uitgevocrd  met  bchulp  van  last  convolutie-  of  corrclalie-algoriimes. 

Het  hiervoorontwikkcldc  nicu  we  algoritme  beeldl  ecn  lang  ecn-dimcnsionaal  convolulicproblcem  af 
op  een  kleiner  tweedimensionaal  convolutieproblcem.  Dit  wordl  bereikl  door  de  lange  convolutie 
te  sectioncren  in  meerderc  korle  convolulics.  Dc  archiieciuur  die  hieruit  voortvloeit  maal$:t  het 
mogelijk  om  met  standaard  FFT  processor-clementen  een  cfficiente  implementatie  in  hardware  tc 
realiseren. 

De  performance  met  bclrekking  tot  data  rate,  input  data  rcekslengte  en  de  benodigde  hardware 
is  zeer  hoog  in  vergelijking  met  de  standaard  gchanieerde  configuraties.  Dc  architectuur  wordt 
toegclicht  aan  de  hand  van  een  conhguraiie  die  geschikt  is  voor  real-time  on-board  SAR  processing 
voor  dc  PHARUS. 
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ABSTRACT 


In  this  report  we  propose  an  algorithm  that  maps  a  large  one-dimensional  linear  convolution  problem 
onto  a  small  two-dimensional  linear  convolution  problem.  We  show  that  this  property  can  be  derived 
from  a  signal  flow  representation  of  a  sectioned  convolution.  The  introduction  of  short  length  EFT 
processor  elements  into  the  signal  flow  graph  results  in  a  structure  that  can  be  efficiently  implemented 
in  a  dedicated  hardware  architecture.  Although  the  peii'onnance  of  the  architecture  is  not  optimal 
in  tenns  of  computational  complexity,  the  perfomiance  in  lenns  of  data  rate,  input  data  sequence 
lengths  and  amount  of  required  hardware  is  high.  The  architecture  de.sign  is  illustrated  with  a 
configuration  suitable  for  real-time  on-board  airbontc  .synthetic  apenure  radar  (SAR)  processing  for 
the  PHased  ARray  Universal  Sar  (PHARUS).  which  is  currently  under  development  at  TNO-FEL. 


SAMENVATTING 


In  dit  rapport  wordt  een  algoritmc  beschreven  dat  een  lang  een-dimcnsionaal  convolutieprobicem 
op  een  kleincr  tweedimensionaal  convolutieprobicem  alriceldt.  Dit  kan  worden  afgcicid  van  de 
signal  flow  representatic  van  een  gescctionecrd  convolulic  problecm,  Dc  introductie  v:ui  korte  FFT 
processorclcmcnten  in  dc  signal  flow  graph  resultcert  in  een  sirucluur  die  efficient  kan  worden 
gcimplementeerd  in  een  dedicated  hardware  archilectuur.  De  perfonnance  met  betrekking  tot  data 
rate,  input  data  reekslengte  en  de  benodigde  hardware  is  hoog,  ondanks  dat  de  perfomiance  van 
de  archilectuur  niel  oplimaal  is  wat  complexitcit  belreft.  De  architccluur  word!  toegelicht  aan  de 
hand  van  een  conliguratie  die  geschikt  is  voor  real-time  on-board  synthetic  aperture  radar  (SAR) 
processing  voor  de  PHased  ARray  Universal  Sar  (PHARUS),  die  op  dit  moment  door  FEL-TNO 
wordt  ontwikkeld. 
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1  INTRODUCTION 

In  1990  TNO-FEL  started  a  real-time  SAR  (RT-SAR)  processing  project  in  cooperation  with  the 
Delft  University  of  Tecltnology  (31.  The  project  is  funded  by  TNO  and  the  Ministry  of  Defense. 

The  RT-SAR  project  is  an  extension  of  the  SAR  activities  at  TNO-FEL.  The  RT-SAR  processor 
will  be  u.sed  in  combination  with  the  Phased  Array  Universal  SAR  (PHARUS).  a  fully  polarimetric 
SAR,  that  is  currently  underdevelopment  at  TNO-FEL. 

The  defense  market  increasingly  demands  higli-quality  remote  sensing  images  in  ctmibination  with 
real-time  processing.  The  advantage  of  using  SAR  for  military  and  civil  applications  is  that  the 
quality  of  a  SAR  image  is  independent  of  weather  conditions,  clouds,  daylight  and  range  (distance 
between  sensor  and  target).  In  combination  with  real-time  processing  instantaneous  anticipation  is 
po.s.sible,  which  is  important  in  crisis  situations.  Examples  of  military  applications  that  require  rcal- 
titne  processing  and  high-quality  images  are  surveillance  and  identification  and  remotely  piloted 
vehicles  (RPV).  Moreover,  with  an  RT-SAR  processor  it  is  possible  to  monitor  the  SAR  system, 
which  increases  the  cflicicncy  of  an  operational  mission. 

Increasingly  complex  digital  signal  processing  (DSP)  applications  can  be  peii'onncd  in  real-time 
with  compact  hardware  due  to  the  developments  in  DSP  hardware  and  VLSI  technology.  To  serve 
the  defense  market  in  the  future  it  is  necessary  to  keep  in  touch  with  these  developments.  Therefore 
TNO-FEL  has  detined  RT-SAR  processing  as  an  application  for  the  development  of  dedicated  real¬ 
time  DSP  hardware.  Moreover,  SAR  pmeessing  is  related  to  applications,  such  as  image  processing, 
inverse  SAR  (ISAR),  inteiierometry,  acoustics,  sonar.  The  RT-SAR  technology  and  experience  can 
be  u.sed  to  design  dedicated  DSP  hardware  for  these  applications. 

Tlie  main  part  of  the  SAR  processing  consists  of  convolutions  of  the  data  sequences  with  a  reference 
sequence.  Since  data  sequences  and  reference  .sequences  can  have  lengths  of  C^(  10^).  real-time 
SAR  processing  requires  fa.st  convoluliim  algorithms  and  a  dedicated  hardware  architecture.  The 
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algorithm  and  architecture  that  is  described  in  this  repem  are  based  on  the  discrete  Fourier  translonn. 

The  discrete  Fourier  translonn  is  ol  great  imponance  in  several  digital  signal  processing  applications. 
Especially  the  calculation  of  convolution,  using  fast  Fourier  translonn  algorithms,  has  been  studied 
extensively.  Therefore  a  lot  of  effort  has  been  devoted  to  the  development  of  fast  convolution 
algorithms.  Most  of  them  are  well-known  and  summarized  in  e.g.  |4.  13).  However,  development 
is  still  going  on  in  this  area,  which  is  sliown  in  recent  papers  17,  lO], 

In  general  algorithms  for  fast  convolution  .section  a  large  convolution  problem  into  smaller  convo¬ 
lution  problems.  Well-known  examples  are  overlap-add  and  overlap-discard  methods  based  on  the 
linearity  of  the  convolution  operation.  More  sophisticated  algorithms  arc  based  on  abstract  algebra, 
such  as  the  Agarwal-Cooley  algorithm  (2)  which  is  based  on  the  Chinese  remainder  theorem.  The 
idea  is  that  a  one-dimensional  cyclic  convolution  problem  of  length  .V  =  .\|.V;  can  be  mapped 
onto  a  two-dimensional  cyclic  camvolution  problem  with  dimension  .\  i  r  .V;.  provided  that  .V]  and 
.N;  arc  relatively  prime.  Methods  to  solve  this  two-dimensional  cyclic  convolution  problem  are 
proposed  in  |4|. 

Most  of  the  FFT  algorithms  emphasize  optimum  pcrlonnance  in  tenns  of  computational  complexity 
or  the  mapping  on  (parallel)  general  purpose  computer  arci'.itccmrcs.  However,  in  many  applica¬ 
tions,  such  as  real-time  radar  and  acoustical  signal  processing,  pcrlonnance  is  tilso  assessed  in  tenns 
of  the  amount  of  hardware  and  pi>wer  consumption  needed.  Moreover,  nowadays  dedicated  VLSI 
FFT  processors  for  lengths  up  to  1024  points  complex  are  available  as  standard  components. 

In  this  report  we  will  describe  a  fast  convolution  algorithm  that  can  be  implemented  in  a  real-time 
dedicated  hardware  architecture.  The  prime  requirement  is  that  the  hardware  architecture  must 
be  realized  using  standard  components  only,  like  FFT  pntcessors  and  multipliers.  Moreover,  the 
final  system  must  be  compact  and  suitable  foron-b«iard  applications.  We  will  assume  that  the  data 
sequences  have  lengths  up  to  in'*,  however  this  can  easily  be  extended. 

Our  starting  point  will  be  the  one-dimensional  convolution  problem.  In  section  2.  we  .shall  describe 
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convolution  using  a  dependence  graph  description.  In  section  3.  we  show  that  it  the  dimension 
of  our  convolution  algorithm  is  limited,  the  convolution  problem  can  be  sectioned  in  time  domain 
using  the  overlap-add  method.  This  is  described  by  a  signal  flow  graph  mapping  of  the  overlap- 
add  convolution  algorithm.  In  section  4,  we  introduce  the  short  length  EFT,  which  is  used  to 
perfomi  short  convoLtion  operations  within  the  signal  flow  graph.  The  overlap-add  convolution 
algorithm  with  short  length  FFTs  is  recognized  as  a  linear  two-dimensional  convolution  algorithm 
in  frequency  domain.  The  dimensions  equal  the  lengths  ol  the  shrrn  FFTs.  Although  not  optimal  in 
tenns  of  computational  complexity,  the  structure  can  be  implemented  very'  c/licrcnily  in  a  hardware 
architecture.  In  .section  5.  we  give  an  example  of  an  architecture,  using  one  FFT  processor,  a  complex 
multiplier,  work  memory  and  an  adder.  The  architecture  has  been  simulated  using  specifications  ot 
state-of-the-art  standard  components  to  detennine  the  pcrlbnnancc  in  tenns  of  data  rate,  memory 
size  and  maximum  input  data  sequence  length.  In  section  6  the  simulated  architecture  is  illustrated 
H  i(h  a  contifzurut  ion  suitable  for  real-time  on-board  airborne  SAR  processing  lor  the  PHa.scd  ARray 
Universal  Sar  (PHARU'S).  which  is  currently  underdevelopment  at  TNO-FEL. 


ONGERUBRICEERD 


TNO  report 


ONGEKUBRICEERD 


Page 

8 


THE  ONE-DIMENSIONAL  CONVOLUTION  PROBLEM 


Lei  a  and  x  be  two  sequences  containing  discrete  time  samples  delined  as  a  =  {a,.]  a  = 

■  •  • ,  - 1 ,0, 1 ,  •  •  •}  and  X  =  { j„|  a  =  •  •  -  ,  - 1 ,0. 1 .  •  •  •}.  Let  r  =  { r,.l  a  =  •  •  • .  -  1 .0.  1 .  • 
be  the  result  of  the  convolution  of  a  and  x.  represented  as 

r  =  a*x  (2.1) 

then  ?■,,  is  given  as 

X 

^  (Hi'u-k  (2.2) 

kzz—  X 

If  a  and  x  have  finite  length  L„  and  Lj.,  re.spcctively.  then  r  has  length  L.,  +  /,  ,  -  1 .  If  we  assume 

that  a,.  =  0  for  a  <  0  and  a  >  L„  and  j  „  =  0  for  a  <  0  and  n  >  /.  ,  then  /•,,  is  given  as 
t.,-1 

^  a;,.l'„_^  (2.3) 

A=(l 

The  convolution  operation  can  also  be  described  recursively.  The  recursive  fonn  of  equation  (2.3) 
is  given  as 

,.('■■)  ^  +  a*.,-„_i.  (2.4) 

where  a  =  0.  •  •  • .  Lj  -2.  L  =  0.  •  •  • .  L„  -  1  and  ;  =  0-  From  equation  (2.4)  the  dependence 

graph  of  figure  2.  La  can  be  derived  [01.  Each  node  in  the  graph  perfomis  a  multiplication  and  an 
addition  (figure  2.  l.b).  The  computational  complexity  of  a  length  .V  convolution  in  time  domain  is 

0(j\'‘). 

From  a  computational  point  of  view  it  is  more  efficient  to  transfonn  the  sequences  to  frequency 
domain  first,  where  the  computational  complexity  for  convolution  is  much  lower,  and  then  perfonn 
an  inverse  transfomiation.  The  transfonnalion  that  is  required  is  known  as  the  discrete  Fourier 
transfonn  (DFT)  and  can  be  efficiently  computed  by  the  fast  Fourier  transfonn  (FFT)  (11,  12).  Let 
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Figurc  2. 1 ;  Dcpcntlcnce  grapli  lor  convoluiion  (a)  and  tlie  clcincmar>'  operation  (b). 

y  =  { //i,  I  "  =  0.  ■  •  • .  -V  -  1 }  be  a  linitc  Icnglli  sequence,  then  its  DFT  V  =  •{  I  /,  =  0.  ■  ■  .  A  -  I ) 
is  dc tilled  as 

\  -  I 

=  Z'/Jl'v'  (2.5) 

>1=0 

where  U'\  =  '  DFT  lor  sequences  with  Icngtli  A'  shall  be  rcl'erred  to  as  A  point  DFT. 

The  frequency  domain  convolution  algorithm  can  be  described  as  follows.  Let  a  and  x  be  two  finite 
length  sequences  with  lengths  L„  and  Lj  ,  respectively.  Let  A  and  X  be  the  .V  point  DFTs  of  a  and 
X.  re.spcctively.  with  A’  =  L„  4  Lj  -  1.  Then  R  =  { /f||  /.■  =  0.  •  •  • .  .V  -  1 },  with  =  .Ic-V;. 
is  the  V  point  DFT  of  r.  The  computational  complexity  of  convolution  in  frequency  domain  is 
C^(  A  log  .V ). 
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3  SECTIONING  OF  LARGE  CONVOLUTION  PROBLEMS 


In  the  previous  section  we  have  considered  the  general  description  ol' convolution  in  time  domain  ajid 
in  frequency  domain.  However,  when  ,V  is  too  large,  which  can  be  the  case  in  many  applications,  the 
direct  implementations  of  the  time  domain  or  frequency  domain  algorithms  will  be  ineltideni  and 
impractical,  especially  when  dedicated  architectures  arc  used.  Therefore  we  can  use  ovci hip-aJJ 
or  ovei  kip-discaid  methods  to  section  one  or  both  sequences  in  smaller  subsequences.  In  this  paper 
we  shall  locus  on  the  overlap-add  method.  For  the  overlap-discard  method  we  refer  to  1 1  1.  12|. 


Let  a  and  x  be  delined  as  in  the  previous  section  and  assume  that  wc  have  some  convolution 
architecture  for  input  sequences  of  length  /.  only  (cither  in  time  domain  ttr  frequency  domain), 
with  L  <  ij  and  /,  /. ,.  Then  wc  can  section  our  convolution  problem  as  follows.  Let  the 


subsequences  a, 

,  and  Xij  of  a  and  x.  respectively,  be  defined  as 

a„  =  j 

1 

/  =  0.  •••./.-  1 } 

ifO  <  In  < 

otherwise 

=  j 

^  \  i'ij  i^i\ 

1 

/  =  0.  •••./,-  1 } 

ifO  C  /,;■  <  /.' 

Otherwise 

Substitute 

k  =  /.  1 

/,  /,  2. 

/.  2 

1 

u  =  n\ 

■  L  +  u2. 

Ill  =  ().•■•./.- 

1 

(3.1) 


in  equation  (2.3),  hence 


Tu\  L  +  hZ 


i-i 

E  E  "kl  L-l-^2  r ,.|  /,  +  ,.2-(/.l  /,+C2) 

k\=0  i.2=() 


l:,-\  L-I 

E  E  'tk\  l  +  kZ  ••'[u\-k\)l.  +  ,a-k2- 

C1=0  k2=0 


(3.2) 


The  interpretation  in  the  dependence  graph  is  as  follows.  Consider  the  dependence  graph  in 
figure  3.1.  where  we  have  assumed  that  /,„  and  L,  are  multiples  of  L.  Obviously  equation  (3.2) 
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Figure  3.1:  Dependence  graph  lor  convnlution  v.  iili  clustering  nl  L  ■  I.  nudes. 


can  be  interpreted  as  the  clustering  ol  /,  ■  /.  nodes.  The  arithmetic  operation  withiti  a  node  is 
rccogni/cd  as  a  convolution  or  two  suhseguenccs  ol  length  /.. 

Now  we  can  derive  a  signal  flow  graph  [‘■rj  of  the  depeinlencc  graph,  where  ue  assume  that  the 
processing  clctncni  perfonns  the  convolution  operation  within  one  cluster,  see  tigurc  3.2.a.  The 
delay  operator  represents  a  delay  of  L  -  1  samples  The  processing  cletncnt  is  shown  in 

tigurc  3.2.b. 


Observe  that  the  computatittnal  complexity  ol  time  doti....n  cottvolution  with  overlap-add  has 
not  changed  compared  to  the  time  domain  convolution  without  overlap-add.  To  dctenninc  the 
computational  complexity  ot  frequency  domain  convolution  with  overlap-add  we  assume  that 
the  input  sequence  length  is  Cri(.Y)  and  the  section  length  is  0(.\t).  with  <  .V.  Then  the 
computational  complexity  is  given  as  C3(  ^  log  M ).  Notice  that  overlap-add  results  in  an  increase 
of  computational  complexity  from  (ri(  .V  log  .V  )  to  Cri(  ^  log  M  ). 

One  stage  of  the  overlap-add  method  we  did  not  mention.  The  subsequences  r^,..  /r  =  0.  ■  ■  ■ .  L'„  + 
L'j.  -  2  must  be  added.  Since  each  of  the  subsequences  has  been  delayed  properly  within  the  signal 
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Figure  3.2:  Signal  flow  graph  derived  from  the  depciKlcnce  graph  (a)  and  the  proces.sing 

cleincnl  (b). 


How  graph  litis  can  be  done  .siraighl-lorward 


/7=(l 


The  additional  eompulaiional  eomplexiiy  is  negligible. 
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4  SHORT  LENGTH  FFT  PROCESSING 


The  question  that  now  arises,  is  how  to  embed  the  FFT  wiiliin  the  overlap-add  structure  such  that 
the  computational  complexity  decreases.  The  straighl-lorAvard  way  (as  is  often  implemented  in 
hardware  architectures)  is  to  perl’omi  the  convolution  within  a  processing  element  in  frequency 
domain.  This  results  to  a  paKcssing  element  as  shown  in  figure  4. 1 . 

Each  subsequence  3/,,  and  X/,.  is  transfonned  L'  times  and  L',,  limes,  respectively.  However,  it 
is  more  efficient  to  make  use  of  the  linearity  of  the  FFT.  Then  we  can  do  one  transfomiation 
per  subsequence  at  the  input  side.  In  the  same  way  wc  can  do  one  inverse  transfonnaiion  per 
subsequence  at  the  output  side.  This  results  in  the  signal  How  graph  in  figure  4. 2. a.  which  is 
equivalent  to  the  signal  How  graph  in  (igure  3.2.a.  The  arithmetic  operation  within  a  processing 
element  is  now  only  an  clciiicnt-wise  multiplication  in  stead  ol  a  convolurion.  Notice  that  we 
perl'onii  the  overlap-add  in  frequency  domain. 

If  we  analyze  the  structure  of  the  signal  flow  graph  of  figure  4.2  we  recognize  that  the  kernel  has 
exactly  the  same  structure  as  the  dcpciHlencc  graph  (except  for  the  delay  operators).  Moreover, 
since  the  multiplication  operator  within  the  processing  clement  operates  element-wise,  the  complete 
kernel  as  a  whole  operates  element-wise. 


Figure  4.1:  Processing  element  of  the  signal  flow  graph.  The  convolution  is  perfonned  in 

frequency  domain. 
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<a) 


Figure  4.2:  Signal  (low  grapii  of  the  overlap-add  in  I'rcqueney  domain  (a)  and  the  processing 

element  (b). 


Let  the  sequences  A,.,,  Xj,  and  R(,  be  the  FFTs  of  the  subsequences  a,,,,  x,,.  atid  respectively 
(sec  figure  4.2.a),  with  the  lengths  of  A;„,  X;,.  and  at  least  2L  -  1.  Let  .-l/„,/,.  Ao-,/.  and  ///r.A, 
4-  =  0, 1 .  •  •  •  be  the  elements  of  A,„.  X;,.  and  R;,..  respectively. 

The  kernel  can  be  represented  as 

K.-\ 

E/,  =  ^  A/(g)X/,  (4.1) 

;=o 

where  l3>  is  the  element-wi.se  multiplication,  and  X;,,  =  0  if  /.;•  <  0  and  >  L'..  Equation  (4.1)  is 
equivalent  with 

i;.-' 

{ R,r.k\  k  =  ^  AuX,r-,,^\  L  =  0. 1 .  •  •  ■}  (4.2) 

1=0 

Obviously,  equation  (4.1)  represents  multiple  one-dimensional  convolutions,  as  in  equation  (4.2). 
The  1:"'  convolution  sequences  are  given  as  {.I;,,  /.!  In  =  ().•  •  •./.',-  1}  and  {-V/j..c|  lx  = 
().  •./.'  -  1}  and  the  re.sull  is  given  as  {/(/,. cl  I''  =  -  2}.  Then  we  can 
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Figure  4.3:  Kernel  of  the  signal  flow  graph  of  the  overlap-add  in  frequency  domain  and  the 

proces.sing  clcmcni.  The  slices  rcprc.seni  signal  flow  graph  of  a  convolution. 


draw  the  kernel  of  the  signal  flow  graph  in  figure  4.2  as  in  figure  4.3.  where  each  A  xlice  represents 
the  convolution. 


Observe  that  the  kernel  of  the  signal  flow  graph  can  be  expressed  as  I.  -  1  convolution  operations 
in  time  dotiiain.  Then  the  next  step  would  be.  logically,  to  perfomi  the  convolution  operation  in 
frequency  domain.  The  equivalent  frequency  domain  description  of  one  slice  is  shown  in  figure  4.4. 
The  FFT  lengths  must  be  at  least  A'  +  Lj.  -  I.  The  delay  operations  have  been  omitted  since  they 
are  implied  by  the  geometry.  Figure  4.5  gives  the  complete  three-dimensional  signal  flow  graph  of 
the  convolver. 


To  detennine  the  complexity  wc  shall  again  as.sumc  the  length  of  the  input  sequences  and  the 
section  length  C^i.V)  and  0{M).  respectively.  The  computational  complexity  of  the  FFTs  of  the 
subsequences  is  Cf(  N  log  M )  and  the  computational  complexity  of  the  kernel  FFTs  is  C)(  .V  log  ^ ). 
Hence  the  computational  complexity  of  the  complete  convolution  algorithm  as  shown  in  figure  4.5 
is  given  as  0(  N  log  M  +  N  log  ^ )  =  0(  N  log  N ). 

The  three-dimensional  structure  of  the  algorithm  in  figure  4.5  implies  that  we  are  doing  some  two- 
dimensional  signal  processing.  In  fact,  the  ovcrlap-add  method  is  a  mapping  of  a  one-dimensional 
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Figure  4.4:  The  frequency  domain  description  of  one  slice  from  the  kernel  of  the  signal  flow 

graph  of  the  convolution  in  frequency  domain  with  overlap-add. 

convolution  problem  to  a  two-dimensional  convolution  problem  w  ith  the  same  complexity.  Consider 
equation  (3.2).  and  Id 

A  A  A 

«/-!./  =  <lti  L  +  l  -rir.l  =  ''..I.,,:  =  I.  +  ..2 

then  equation  (3.2)  becomes 
/-'-I  /.-I 

'■..1..,:  =  XI  (4.3) 

<.1=0  <,:=(! 

which  is  the  dclinititin  of  a  two-dimensional  linear  convolution. 

This  result  has  been  derived  in  ( 1 1  and  developed  funherin  |2.5).  ln|4. 13)  the  algorithm  that  solves  a 
onc-dimcnsional  convolution  problem  by  mapping  it  onto  a  two-dimensional  convolution  problem 
is  summarized,  neferred  to  as  the  Agarwal-Coolcy  convolution  algorithm.  One  should  notice 
lhal  the  Agarwal-Coolcy  convolution  algorithm  is  only  defined  for  .V|  .V2-point  one-dimensional 
cyclic  convolutions,  with  .N'l  and  .Vi  relatively  prime.  The  algorithm  that  we  propose  solves  a 
linear  convolution  of  sequences  with  arbitrary  length  with  a  two-dimensional  linear  convolution  of 
dimensions  2L  -  1  x  +  A'  -  I.  The  only  requirement  is  that  we  need  two  FFTs  of  at  least 
2L  -  \  points  and  L[,  -f  L'.  -  1  points,  respectively. 


ONGERliBRICEERD 


Thrc'  ''  nepsionai  signal  flow  graph  of  a  convolulion  algorithni  that  consists  of 
FFTs  and  muitiplicalions.  In  stead  of  long  FFTs  in  one-dimension  short  FFTs  arc 
used  in  two-dimensions.  The  polygons  repre.sent  the  FFTs. 
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5  SIMULATION  RESULTS 


Algorithm  1  gives  a  straight-forward  representation  of  the  convolution  algorithm.  It  is  obvious  that 
this  is  not  optimal  in  tenns  of  computational  complexity.  However,  the  structure  of  the  algoriilim 
makes  it  very  suitable  for  implementation  in  dedicated  architectures  with  short  length  FFT  processor 
elements. 


Algorithm  1  :  Convolution  Algorithm 


U/ar.Y;  /  r.irr  =  o  /.;,  -  i 

!  —  tu;.  .  | 

otid 

for =  0  r\  -  I 

{  A  ti  -  ■  ■  .  A\  ;  }  - -  \ /  .  ■  . ^  _  I  f 

i‘iid 

STMif:  Tori  =  0  2(7.  -  1) 

-'I;., —  { -lu  1 

(‘lid 

for  /  =  0  ;  2(7,  -  )) 

end 

STAIiCJ  for  /  =  0  2(/.  -  I) 

for  A  =  0  ;  i.;,  +  /.',  -  2 


t*Md 


WA.,.-  ..V'.  ) 


i'lid 


STA(,[:4  for  A  =  0  ;  2 


’i"-  '- ’I.:! /.-!.>  ■ —  {/Ai  /.-n  > 


riid 


I  STACE  A  for  k  -  0  1.',  +  L',  -  2 
for  /  =  0  ;  -  1 

’‘i  +  '  ■ —  ’’I  I  +  ’^i-i  r.  +  i 

end 

end 


To  illustrate  this  we  will  give  an  example  of  an  ef(lcicnl  hardware  architecture.  We  assume  that  we 
have  a  complex  multiplier  and  an  FFT  processor  with  two  modes;  a  2  L  point  mode  and  an  -t-  /,'. 
point  mode.  Furthemiore  wc  have  two  work  memories  and  some  add  and  delay  hardware.  The 
architecture  is  shown  in  figure  5.1. 
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Figure  5. 1 ;  An  example  nl  a  hardware  archiiceiure. 

The  algorillun  has  been  divided  in  5  slagcs  (see  algorithm  1 ).  Each  stage  has  its  own  funelionalily. 
as  is  shown  in  table  5.1.  The  architecture  computes  the  slagcs  1  to  5  sequentially.  In  figure  5.2  the 
data  (low  tor  each  stage  is  shown. 


To  determine  the  data  rate  of  the  convolution  hardware  architecture  we  will  assume  that  Z-,,  <  L^  . 
We  further  assume  that  the  smrage  capacity  of  mctnor>'  I  is  at  least  2lA  /.'  -f  )  samples  and  the 
.storage  capacity  of  mcmor>'  2  is  at  least  4/d  //,  +  E'  )  samples.  The  data  rale  is  defined  as 


_  no.  of  data  samp'  :s  to  be  processed 
processing  lime 


data  rate 

Furthennore  we  define  the  follow  ing  parameters; 


•  ///I :  the  data  rale  of  the  FFT  proces.sor  in  2  /,  point  mode 

•  Df2'.  the  data  rate  of  the  FFT  processor  in  //,  +  E'.  point  mode 


Stage 

In 

Process 

1 

input 

FFT 

21 

mem.  1 

2L(i'  +L'.) 

2 

mem.  1 

FFT 

K  +  L'. 

mem.  2 

3 

mem.  2 

multiply  +  IFFT 
(in  pipeline) 

mem.  1 

2L{L:,  +  L[.) 

4 

mem.  1 

IFFT 

2L 

mem.  2 

2L(L:,  +  L',) 

5 

mem.  2 

delay  +  addition 

- 

output 

UK  +  K) 

Table  5. 1 :  Functional  description  of  each  stage. 
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Figure  5.2:  Data  How  ctf  cacli  stage.  Ob.serve  that  stage  5  can  be  done  in  parallel  w ith  stage  1 . 

•  D,„:  the  data  rate  of  the  complex  multiplier 

•  /’/;  no.  ol  FFT  processors  in  parallel 

•  no.  of  complex  multipliers  processors  in  parallel 

•  r  =  1 .  •  •  •  .4:  pnKCSsiiig  time  of  stage  i 


Then  we  have 


_  4UK+L'J 


7-2  =  ^ 

7\  -  2t(/.:.-rt-,)  _  ZUL'+L',) 

^  mill  t  f\,t  I  .t'f  U  fi)  ^ 

From  figure  5.2  wc  sec  that  stage  5  can  be  done  in  parallel  with  stage  1 .  This  means  the  architecture 
is  ready  for  the  next  convolution  job  when  stage  4  is  finished,  so  it  is  not  relevant  to  define  T5.  The 
data  rate  D  of  the  architecture  is  given  as 
Ir 


D  = 


E?=i  T. 


(5.1) 
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i 

1  :,  +  !.■, 

u 

(  X  1 

- r. 

IIUMfUTN  L 

apavtU 

Uiuxfinuin  i 

Mfin  1 

Mi-m  :  ; 

1 

K 

Ih 

2  10^ 

12S  -  /.„ 

246 

512 

•) 

64 

ntp 

.VI2  - 

1  A 

2  A 

2V6 

1  10' 

lix  -T,. 

4h 

«A 

4 

ft 

IK 

2  10-' 

X/i  - 

16  A 

42A 

< 

32 

16 

5  10' 

V12  - 

1  A 

2  A 

6 

32 

64 

1  to' 

> 

\ 

4  A 

Wh 

7 

32 

236 

4  10-’ 

8K  -  /,.. 

16A 

42A 

K 

32 

IK 

40 

.42A  - 

64  A 

128  A 

12K 

16 

1  10' 

2  A  -  /  „ 

4  A 

8A 

ID 

t:ii 

64 

3  10-' 

XA  -  ; 

16  A 

4  2  A 

1  1 

12k 

2^6 

KO 

42  A  -  /  . 

64 1\ 

128  A 

i; 

12K 

IK 

10 

ni  A  -  /  ,. 

<24A 

n 

vi: 

16 

I  l(^'■ 

KA  -  / ... 

1 6 1\ 

42A 

14 

vi; 

64 

40 

42 A  -  /  , 

64  A 

Txa 

IV 

23h 

') 

I'lA  -  ;  , 

2s2A 

V:4A 

16 

V12 

_ 

2 

V24A  -  /  ., 

1  4; 

2  M 

Table  5.2:  Pcrlbniuuicc  figures  and  niemi>r>'  requiremenis  lor  an  arehitceiure  wiih  a  single- 

chip  siate-ol-tlie-an  FFT  processor.  A.  A!,  -f  I 1.,  and  the  incinorv  capacilv  arc 
represenied  by  no.  oi  coinple.x  samples.  The  daia  rate  I)  is  represented  by  the  no, 
ot  complex  samples  per  second. 


ir  we  require  i  then  equation  (5. 1 )  becomes 

1)  =  - - - _Ll -  (5,2 1 

2A{A:,  -  A'  -1-  51)-;!  I 

j  *  .  • 

To  illustrate  the  perf’onnancc  of  an  implementation  in  hardu  are  with  standard  components  ue  )ia\  e 
simulated  the  architecture  using  the  specilicaiions  ol  a  single-chip  state-of-the-art  FFT  processor. 
The  FFT  processor  has  4  complex  modes;  1 6  point.  64  point.  2.‘'6  point  anil  1024  point.  The  data  rates 
for  these  modes  are  l)f,[  16)  =  -  Dj, (25t]  ~  25  MSampIcs/sec  and  l)j,i  1024  i  s:  10 

MSamples/sec.  i  -  1.2.  The  pcrfonnance  figures  and  memory  requirement  arc  listed  in  table  5,2. 
Observe  that  the  data  rate  increases  linearly  with  the  number  of  FFT  processors  in  parallel. 
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6  AN  IMPLEMENTATION  EXAMPLE 


In  this  section  we  will  describe  a  configuration  of  a  real-time  processor  architecture  for  the  Phased 
Array  Universal  Synthetic  Aperture  Radar  (PHARUS).  PHARUS  is  a  phased  array  polarimcinc 
SAR  that  is  currently  under  development  at  TNO  Physics  and  Electronics  Laboratory,  For  detailed 
specifications  we  refer  toe.g.  (6,  8|.  We  will  restrict  ourselves  to  the  specifications  which  arc  needed 
for  our  configuration  and  we  will  not  consider  phase  error  correction.  The  specifications'  arc  as 
follows: 

•  The  configuration  is  detennined  for  one  polarimctric  direction. 

•  The  number  of  looks  is  4  with  5(K'i  overlap. 

•  The  resolution  is  4  m. 

•  The  minimum  range  /L,,,„  is  7  Km. 

•  The  maximum  range  liu.  u  is  Ib  Km. 

•  The  sample  frequency  in  range  /,  is  1(X)  MH/. 

•  The  pulse  length  r,  is  19.7  //sec. 

•  The  sample  frequency  in  azimuth  /„  is  292  Hz, 

•  The  aperture  length  t„  is  maximally  4.41  .sec. 

Range  Compression 

From  the  sample  frequency  in  range  wc  can  determine  the  distance  between  the  slant  range  r/r  as 

dr  (6.1) 

^Jr 

'The  specifeations  have  been  obtained  from  PHARUS  document  no.  PH9252FEL. 
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This  implies  that  ihc  number  ot  range  samples  A ,  is  given  as 

A ,  = - ; -  =  6  Ksamples 

dr 

The  processing  ot'ihe  range  samples  must  be  perfonned  within  j-  seconds,  which  gives  the  required 
data  rate  ol  the  range  compressitrn  as 

Dr  =  A’r/a  ~  1 .75  Msamples/scc 

The  number  ol  samples  ol  the  rcl'erencc  sequence  M.  is  given  as 

M,  =  T,  J,  ^  2  Ksamples 

If  we  compare  this  wiili  table  5.2  we  see  the  requirements  are  met  with  the  7'  or  10"  row.  with 
/‘/  =  1.  So  we  can  pcrlonn  real-time  range  compression  for  PH.ARUS  with  a  configuration  with 
one  FFT  processor  chip  in  64  and  256  point  mode  or  256  and  64  p.)ini  mode,  respectively.  For  both 
configurations  we  need  work  memories  of  16  Ksamples  compiev  and  32  Ksamples  comple.x.  The 
data  rate  for  this  configuration  is  l.X  Msamples/scc. 

A  :i ninth  C oniprcssioii 

The  maximum  number  of  samples  per  aperture  M„  is  given  as 
=  T,  f^  s:  1.3  Ksamples 

Since  we  have  4  looks  with  505!  overlap  the  maximum  number  of  aperture  samples  per  look  .\/'  is 
given  as 

=  520  samples 

Let  .\  „  be  the  number  of  azimuth  samples  per  batch  that  is  processed.  Then,  within  .seconds.  4 
(the  number  of  looks)  times  A’,  az.imuth  lines  of  length  .V„  must  be  processed  (i.e.  convolved  with 
a  length  reference  sequence).  The  required  data  rate  D„  of  the  azimuth  compression  is  thus 
given  as 

D„  -  4.V,/„  s:  7  Msamples/scc 
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]f  we  choose  .V„  =  1 .5  Ksainples  liien  ihe  requirements  arc  met  ve  itli  the  or  6'  ■  row  ol  table  5.2, 
with  [']  —  5.  So  we  can  perfomi  real-time  azimuth  compression  for  PH  ARUS  with  a  contiguration 
with  five  FFT  processor  chips  in  16  and  256  ptiint  mode  or  64  and  64  point  mode,  respectively.  For 
both  configurations  we  need  work  memories  of  4  Ksainples  complex  and  8  Ksamplcs  complex.  The 
data  rale  for  this  contiguration  is  7.5  Msampics/scc.  However,  if  we  choose  A  ,  =  7.5  Ksainples 
then  the  requirements  arc  met  with  ihc  1''‘  or  lO'^'  row.  with  I’f  —  4,  In  this  case  we  need  four  FFT 
processor  chips,  but  the  data  sequences  get  longer,  which  is  not  always  desirable,  and  the  required 
work  memory  sizes  increase  to  16  Ksainples  complex  and  .42  Ksaln[)le.^  complex. 
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7  SUMMARY 


In  this  paper  we  described  the  mapping  of  a  one-dimcnsional  large  linear  convolution  patb/cm  onto 
a  small  linear  two-dimensional  convolution  problem.  A  dependence  graph  and  signal  flow  graph 
representation  has  been  used  to  derive  this  property.  By  introducing  the  short  length  EFT  into  the 
signal  flow  graph  representation  we  obtained  a  three-dimensional  signal  flow  graph  with  short  length 
FFTs  and  multiplications  only.  Although  this  algorithm  is  not  tipiimum  in  icnns  cd  compuiatio/tal 
complexity,  it  can  be  eliiciently  mapped  onto  a  hardware  archiieciure  based  on  standard  components. 
The  simulation  results  showed  that  the  architecture  perionns  very  well  in  icnns  ol  system  data  rale, 
input  data  sequence  length  and  amount  of  required  hardware.  The  architecture  is  illustrated  with 
a  configuration  suitable  for  real-time  on-board  airborne  SAR  processing  tor  the  PHa.sed  ARrav 
Universal  Sar  (PHARUS).  This  "quick-look"  hardware  implementation  will  be  subject  for  further 
work  itt  the  near  future. 
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